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1. INTRODUCTION 

India’s agricultural history goes back to the Indus Valley Civilization. Agriculture and other related 
operations in India contribute (17-18)% to the Gross Domestic Product, which has a significant effect on the 
Indian economy. Agriculture plays an important part in India’s social and economic system and is the largest 
economic segment in terms of demographics [1]. Crop output prediction can help the government build crop 
insurance policies and supply chain operation policies using big data analysis [2]. It can also help farmers by 
supplying them with a prediction of the past crop yield record that decreases risk management [3]. 

The sum of data is rising exponentially, while the speed of estimation is slowing down. Instances of 
large data include crop production, the field used, and crop yield. Since the government systematically and 
continuously gathers data on crop production and yield, the scale of the dataset is known to be big data, 
which is real-world data that is very difficult to interpret [4]. Statistical methods and data mining can be 
extended under distributed and parallel computing platforms to analyze big data and often consumes huge 
processing time and volume of storage to accommodate vast data sets [5]. Data mining technique plays a 
crucial role in data analysis. Data mining is a subfield of interdisciplinary computer science and analytics 
with an overall target of identifying trends, patterns, and associations within broad data sets that include 
strategies at the intersection of machine learning, database systems, and statistics [6]. Data mining utilizes 
specialized statistical algorithms with the ultimate purpose of collecting data by segmenting the data and 
converting the information into an understandable framework to determine the possibility of future events [7]. 
There are two kinds of learning approaches to data mining: unsupervised (clustering) and supervised 
(classifications) [8]. Clustering 1s the practice of evaluating a list of “data points” and sorting them according 
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to a distance calculation into separate “clusters” [9]. When grouping these data points, the goal should be for 
data points in the same cluster to be a small distance from each other, whereas data points in separate clusters 
should be long-distance from each other [10]. Data is grouped into well-formed classes through cluster 
analysis. The normal data structure can be captured by well-formed clusters [11]. 

This paper aims to lessen the manual work of applying data mining algorithms by using different 
python modules. This paper uses python-based libraries (numpy, pandas, seaborn, K-means, principal 
component analysis (PCA), tools, functions, and methods to quickly analyze, mine, and visualize the 
agriculture dataset. The dataset is visualized using distplot combined with a kernel density estimate (KDE) 
plot. K-means clustering technique is used in the current work to form clusters from the agricultural dataset. 
Compared to other clustering algorithms, the K-means algorithm is extremely simple to implement and is 
also very effective in computation, which may explain its popularity. The clusters obtained are visualized by 
reducing their dimensions using principal component analysis. The remainder of this paper is organized as 
follows: section 2 explains the methodology for visualizing and clustering the dataset. Section 3 presents the 
results and finally, section 4 concludes with some directions for future work. 


2. RESEARCH METHODOLOGY 

This paper aims to propose a method to analyze agricultural data using data mining techniques. 
Agriculture data has been obtained from credible sources in the proposed work. Input dataset consist of data 
with following parameters namely: crop name, production (2006-2011), area (2006-2011), yield (2006-2011) 
[12]. In the proposed work, the K-means clustering method is used to cluster data based on crops with 
identical output, area, and yield amounts [13]. Distplot combined with Kernel density estimation (KDE) is 
used for visualizing the probability density at different values in a continuous variable of the dataset which 
can improve its prediction accuracy. The principal component analysis is used for dimensionality reduction 
of the dataset at keeping the original information unchanged [14]. The optimum parameters for maximum 
output can be obtained based on this analysis. 

Clustering is the process of dividing a dataset into groups such that entities in each cluster are 
comparatively more similar to entities of that cluster than those of the other clusters. In a dataset, Clustering 
can reveal undetected connections. In the proposed work, we have used the K-means algorithm to cluster our 
agricultural data. The K-means algorithm belongs to the prototype-based clustering group. Prototype-based 
methods seek to define the data set to be categorized or clustered by a (usually small) set of prototypes, 
particularly point prototypes, which are simply data space points [15]. Each prototype is intended to capture 
the distribution of a group of data points based on a definition of similarity to the prototype or closeness to its 
position that may be affected by the size and shape parameters of the (prototype-specific) [16]. Our goal is to 
group the dataset based on their similarity in characteristics, which can be accomplished using the algorithm 
K-means that can be summarised in the following six steps [17] in Figure 1. 
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Figure 1. Steps for applying K-means clustering 


Measuring similarity between objects: similarity is defined as the opposite distance, and the squared 
Euclidean distance between two points p and q in m-dimensional space is a commonly used distance for 
clustering samples with continuous features [18]. 


d(p,q)* — Xii -qd = lip — alld (1) 


Note that the index i in the preceding equation refers to the i“ (feature column) dimension of sample 
points p and q. The K-means algorithm can be defined as a simple optimization problem based on this 
Euclidean distance metric, an iterative approach to minimizing the sum of squares within the cluster (WCSS) 
[19], which is often also called cluster inertia. 
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WCSS = YE YE, wl ||p® — MO (2) 


where uO? is the centroid for cluster j, w”) is equal to 1 if the sample „(ù is in cluster j, otherwise, its value 
is equal to 0. One of the disadvantages of this clustering algorithm is that the number of clusters, k, a priori, 
must be specified. Poor clustering performance may result in an inappropriate option for k. For any 
unsupervised algorithm, the calculation of the optimal number of clusters into which the data may be 
clustered is a fundamental step. One of the most common methods for evaluating this optimum k value is the 
elbow method [20]. Using the K-means clustering method using the sklearn python library, we are now 
demonstrating the provided method. 


2.1. Creating and visualizing the data 

Data visualization is the representation of the data values in a pictorial format. Visualization of data 
helps in attaining a better understanding and helps draw out perfect conclusions from the data. Data 
visualization plays a crucial role in any data analysis [21]. It helps to recognize which variables are important 
and which variables can influence our prediction model. While preparing any machine learning (ML) model 
we have to initially discover which characteristics are significant and how they can affect the result. This can 
be done by analyzing the data through data visualization. 

— Python seaborn module: The data visualization modules present in Python depends on the Python 
Matplotlib library. Python seaborn is also one of those data visualization modules which provide functions 
with better efficiency and plotting features. With seaborn, data can be presented with different 
visualizations and different features can be added to it to enhance the pictorial representation [22]. 

—  Distplot: A distplot or plot of distribution demonstrates the variance in the distribution of data. The 
Seaborn distplot can also be clubbed along with the kernel density estimate (KDE) plot to estimate the 
probability of distribution of continuous variables across various data values. 

— KDE plot: It is a plot that depicts the probability density function of the continuous or non-parametric data 
variables, 1.e., we can plot for the univariate or multiple variables altogether [23]. 

— Heatmaps: One of the important built-in functions in the direction of data exploration and visualization in 
seaborn is heatmaps. Seaborn heatmaps visualize the data and represent it in the form of a summary 
through the graph/colored maps [24]. Distplot combines two plots. It combines matplotlib. Hist function 
with seaborn deplot(). We have used heatmap for finding correlations in the dataset. Figure 2 describes the 
code for creating and visualizing the dataset, which includes 4 blocks representing the code for importing 
the libraries, loading the dataset, plotting the distplot, and plotting the heatmap respectively. 


import pandas as pd 
import mumpy as np crop df = pd.read csv('/content/drive/ 


import seaborn as sna My Drive/datafile (2).csv') 
import matplotlib.pyplot as plt 


Plt.figure (figsiszse=(10,50)) 
for 1 in range (len(crop_dfi.columns) } : 
pit.subplot (15, 1, 141) 

sns.distplot(crop dtifcrop dft.columns[i]l, k 
de kwa={"color™; "b", "lw": 3, "label": "KOE 


correlations = crop df.corr() yo ed a i i 
= ha Liat We "20 Lor" 


f, ax = plt.subplots(figsize = (20, 20)) plt.title(crep dif.ca ii 
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sns .heatmap (correlations, annot = True) 


Figure 2. Steps involved in visualizing the dataset 


2.2. Finding number of clusters K by elbow method 

This is perhaps the best-known means of estimating the optimum number of clusters [25]. In its 
method, it is also a bit naive. Measure the within-clusters-sum of squares (WCSS) for various k values, and 
pick the k for which WCSS begins to diminish first. This is evident as an elbow in the plot of WCSS-versus- 
k. Within-cluster-sum of Squares sounds sort of complicated. Let's break down this in Figure 3. 

We need to scale the continuous features to give all characteristics equal significance. Scikit-learn's 
standard scaler will be included. We will initialize K-means for each k value and use the attribute of inertia to 
define the number of squared sample distances to the nearest cluster core. The squared distance number tends 
to zero as k increases. Imagine that k is set to its maximal value n (where n is the number of samples) and 
each sample forms its cluster, meaning the total of square distances is equal to zero. The code used to map 
the total k square distances is defined in Figure 4. This figure depicts the four blocks representing the code 
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for importing the libraries, scaling the dataset, initializing the K-means for each k value, and applying the 
elbow method, respectively. If the plot looks like an arm, so an ideal k is the elbow on the arm. Using the 
sklearn library and our feature for calculating WCSS for several values for k, let us implement this in Python. 





The Squared Error for each point is the square of the distance of 
the point from its representation 1.e. its predicted cluster center. 
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Figure 3. Brief description of WCSS 
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Figure 4. Steps involved in finding the number of clusters (K) by elbow method 


2.3. Applying K-means and principal component analysis (PCA) 

In the code for applying the K-means algorithm, the K-means object has been created and passed as 
the number of clusters “K” obtained from the elbow method. In the next line fit method on K-means has been 
called and the “crop df scaled” dataset has been passed through it K-means. Labels_ is used to see the labels 
for the datapoints. Via dimensionality reduction, the clusters we have identified after applying the K-means 
clustering approach can be visible. PCA is an effective tool for visualizing high-dimensional data in 
combination with K-means. It is an unsupervised machine learning algorithm. PCA projects them into a 
lower-dimensional vacuum, restricts them, and visualizes them to only a few significant key ones [26]. 
Figure 5 describes the code for implementing PCA on the dataset, each block in this figure represents the 
code for obtaining the principal components, creating a dataframe with two components, c concatenating the 
labels to the dataframe, and visualizing and interpreting the clusters. 


pra = PCAC cCoEponents=2 } pia df = pd. Datelrase (date = princi 
principal comp = pea. fit_traniforsa(crop_df_sacaled) moa df head } 
pring ipal_« Pet] 


arrai |i peal 


[-2.a1 
[-1.381 

[-@. 389314 
[-@. 18121285 


Cte | = Piri aa 
prsa f. hemdi 1 


peed peat cluster 
plt. figure (figsize=(10,10)) 
ax = sns.scatterplot(x="peal", y="peaz", bua = "cluster", 
data = pea df, palette =['red', ‘green’, ‘blue’, 'black']]| 


plt. show) 





Figure 5. Steps involved in applying PCA on the dataset 
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3. RESULTS AND DISCUSSION 
3.1. Visualizing the dataset 

The dataset must be visualized before applying the K-means algorithm to the dataset. Results of data 
visualization are shown in Figures 6 (see Appendix) and 7. Figure 6 (see Appendix) depicts the KDE plot 
combined with distplot is plotted for the dataset to analyze the data through visualization. Figure 7 depicts the 
result of the heatmap plot which is plotted by representing the dataset in the form of a 2-dimensional format 
for finding correlations among the data. 





Figure 7. Heatmap for the given dataset 


3.2. Clustering 

To calculate the K value (number of clusters), the elbow method is applied to the dataset. The 
outcome of the elbow process is represented in Figure 8, and it depicts the result of the elbow plot which is 
plotted using the within-cluster sum of squares for a range of values of K. The optimum number of clusters 
(K value) is determined by choosing the “elbow” value of K, i.e., the point at which the WCSS starts to 
decrease linearly. Therefore, we assume that the number of clusters is 4 for the given dataset. Table 1 depicts 
the result of the K-means clustering algorithm. Figure 9 depicts the clusters we have obtained, represented by 
reducing their dimensions using Principal component analysis. Crops are commonly picked for their 
economic significance. The agricultural planning process, however, involves an estimate of the yield of many 
crops. In this context, using data availability as the main metric, 54 crops have been selected for this work. 
Crops were only chosen when appropriate data samples came under review in the 6-year range (2006-11). 

As a result of the K-means clustering algorithm, 4 clusters are formed. Cluster 0 represents the crops 
having medium production, high area, and medium-low yield. Cluster 1 represents the crops having low 
production, low area, and medium yield. Cluster 2 represents the crops having high production, medium area, 
and high yield. Cluster 3 represents the crops having medium-low production, medium-low area, and low 
yield. Principal component analysis is used to represent the clusters by reducing their dimensions. The 
present work covers the distplot combined with the kernel density estimate plot and heatmap for 
visualization. The elbow method is used for finding the optimal number of clusters “K”. K-means clustering 
algorithm is applied to form clusters from the dataset. The principal component analysis is used to represent 
the clusters formed by reducing their dimensions. The crop data collection can be analysed using these 
methods and the optimum parameters for crop production can be calculated. 
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Figure 8. WCSS vs K plot (elbow method) 


Table 1. Clusters obtained from the K-mean algorithm to represent crops as per production, area, and yield 


Production Area Yield 
Cluster Crops 
range range range 

Cluster 0 (Medium Rice, Maize, Soyabean, Dry ginger, Arecanut, Garlic, Total Fruits & 199.59- 168.56- 119.57- 
Production, High Area and Vegetables, Potato, Onion, Banana 299.95 213.63 140.70 
Medium-Low Yield) 
Cluster 1 (Low Bajra, Ragi, Small millets, Barley, Sesamum, Rapeseed & Mustard, 97.27- 71.24- 134.70- 
Production, Low Area, Linseed, Safflower, Niger seed, Mesta, Jute & Mesta, Sannhamp, Dry 120.54 76.68 154.15 
and Medium Yield) chilies, Cardamom, Coriander, Sweet potato, Tobacco 
Cluster 2 (High Total Spices 1427.70- 121.30- 1172.10- 
Production, Medium Area, 1790.60 136.60 1310.80 
and High Yield) 
Cluster 3 (Medium-Low Total Foodgrains, Wheat, Jowar, Coarse Cereals, Cereals, Gram, 150.048- 121.25- 123.97- 
Production, Medium-Low Arhar, Other Pulse, Total Non-Food grains, Total Oilseeds, 174.83 126.68 137.51 
Area, and Low Yield) Groundnut, Castor seed, Sunflower, Nine Oilseeds, Coconut, 


Cottonseed, Total Fibers, Cotton (lint), Jute, Tea, Coffee, Rubber, 
Black pepper, Turmeric, Tapioca, Sugarcane. 
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Figure 9. Result of PCA 


4. CONCLUSIONS AND FUTURE WORK 

In developing countries such as India, agriculture is the most significant application field. In 
agriculture, the use of information technologies can improve the decision-making scenario, and farmers can 
perform more. In several matters relating to the agriculture sector, data mining plays a key role in decision- 
making. This paper discusses the role of data mining from the perspective of the agriculture field. On the 
input data, different data mining techniques are applied that can be used to determine the best output yielding 
process. To obtain the optimum parameters to achieve higher crop yield, the present study used data mining 
techniques such as K-means clustering, principal component analysis. Through this paper, an attempt is made 
to lessen the manual work of applying data mining algorithms by using different python modules. Expanding 
the present work to evaluate soil, climate conditions, demand data, and other variables for the crop to 
improve the crop yield is scope for future work. 
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APPENDIX 
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Figure 6. Distplot combined with KDE plot for the given dataset 
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Figure 6. Distplot combined with KDE plot for the given dataset (continue) 
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Figure 6. Distplot combined with KDE plot for the given dataset (continue) 
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